Motivation#
Main purpose: understand how well SF’s 911 medical call response system is performing and see where it can be improved.
Key Insights#
received_to_onscene: %64.83 within 10 (code 3)
received_to_onscene: %70.69 within 20 (code 2)
Time to onscene: roughly 80% of time is between responding to on scene.
20% getting actually dispatching
Outer regions have a slightly longer total response time, statistically significant
Recommendations#
Investigate how ambulances are being dispatched: are they taking optimal routes? What are the biggest blockers to them getting to the scene?
Investigte what takes so long for available units to be dispatched once a call is received. Is there good communication between DEM and SFFD?
Table of contents#
Background#
The life cycle of a 911 call: source
According to the 2023 SF Annual Performance report, SFFD responds to 80-90% of 911 calls requiring an ambulance, and their goal is respond in under 10 minutes for code 3 calls (life-threatening) and under 20 minutes for code 2 calls (non-life-threatening). Their definition of their response time is the “Roll/Response” time (see above, source), which is only a portion of the lifecycle of a 911 call. However, if I’m the caller, I ultimately care about the entire duration between initating the call and when an ambulance arrives - the “Call Duration” time (see above). This is the metric that I analyze in this project.
Data sources#
Call duration breakdown#
show joint plots
plots by hour of day, day of week
Exploring geographies for responding#
Exploring dispatching#
most aren’t late
diversions aren’t an issue (seemingly)
Data intro / cleaning#
show filters, and what percent of data this knocks out
show calculation of dates, percentages
high level#
import plotly.io as pio
pio.renderers.default = 'notebook'
from pylab import rcParams
rcParams['figure.figsize'] = 13, 6
import pandas as pd
raw = pd.read_parquet('raw_data_analysis/08122024-5yr.parquet')
print(f'raw shape: {raw.shape}')
df = raw[(raw.call_date.dt.year >= 2021)]
print(f'2021 onwards shape: {df.shape}')
raw shape: (1626986, 34)
2021 onwards shape: (1212862, 34)
df.loc[:,'on_scene_hour'] = df.on_scene_dttm.dt.hour
df.loc[:,'on_scene_day_of_week'] = df.on_scene_dttm.dt.dayofweek
from sklearn.preprocessing import StandardScaler
def add_norm_lat_long(df):
lats, longs, slats, slongs = [], [], [], []
for coord in df.case_location.values:
if coord and 'coordinates' in coord:
long, lat = coord['coordinates'][0], coord['coordinates'][1]
slongs.append(long)
slats.append(lat)
longs.append(float(long))
lats.append(float(lat))
else:
slongs.append(None)
slats.append(None)
longs.append(pd.NA)
lats.append(pd.NA)
df.loc[:,'lat'] = lats
df.loc[:,'long'] = longs
df.loc[:,'slat'] = slats
df.loc[:,'slong'] = slongs
df = df[df.lat.notnull() & df.long.notnull()]
lat_long_scalar = StandardScaler()
norm_lat_long = lat_long_scalar.fit_transform(df[['lat', 'long']])
norm_lat_long_df = pd.DataFrame(norm_lat_long, columns=['norm_lat', 'norm_long'])
norm_lat_long_df.index = df.index
df_normed_lat_long = pd.concat([df, norm_lat_long_df], axis=1)
return lat_long_scalar, df_normed_lat_long
lat_long_scalar, df = add_norm_lat_long(df)
df = df[df.lat.notnull() & df.long.notnull()]
/var/folders/h2/v02yh68x5vz3hbxw0mxbfpsw0000gn/T/ipykernel_86173/410091565.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/var/folders/h2/v02yh68x5vz3hbxw0mxbfpsw0000gn/T/ipykernel_86173/410091565.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/var/folders/h2/v02yh68x5vz3hbxw0mxbfpsw0000gn/T/ipykernel_86173/410091565.py:18: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/var/folders/h2/v02yh68x5vz3hbxw0mxbfpsw0000gn/T/ipykernel_86173/410091565.py:19: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/var/folders/h2/v02yh68x5vz3hbxw0mxbfpsw0000gn/T/ipykernel_86173/410091565.py:20: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/var/folders/h2/v02yh68x5vz3hbxw0mxbfpsw0000gn/T/ipykernel_86173/410091565.py:21: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df.unit_type.value_counts(normalize=True)
unit_type
ENGINE 0.317043
MEDIC 0.304962
TRUCK 0.096494
PRIVATE 0.081911
CHIEF 0.071156
SUPPORT 0.044188
RESCUE CAPTAIN 0.036955
CP 0.025088
RESCUE SQUAD 0.013243
BLS 0.007993
INVESTIGATION 0.000877
AIRPORT 0.000089
Name: proportion, dtype: float64
df['received_to_dispatch'] = (df.dispatch_dttm - df.received_dttm).dt.total_seconds() / 60
df['dispatch_to_response'] = (df.response_dttm - df.dispatch_dttm).dt.total_seconds() / 60
df['response_to_onscene'] = (df.on_scene_dttm - df.response_dttm).dt.total_seconds() / 60
df['received_to_onscene'] = (df.on_scene_dttm - df.received_dttm).dt.total_seconds() / 60
df = df[(df.received_to_dispatch >= 0) & (df.dispatch_to_response >= 0) & (df.response_to_onscene >= 0)]
df.shape
(961275, 46)
import seaborn as sns
import matplotlib.pyplot as plt
tmp = df[df.unit_type == 'MEDIC']
tmp = tmp.loc[tmp.groupby('incident_number')['dispatch_dttm'].idxmin()]
ax = sns.boxplot(tmp, y='received_to_onscene', hue='final_priority', hue_order=['2', '3'], showfliers=False)
# Retrieve the lines that make up the boxplot
for i in range(2):
# Get the whiskers and box information
# Lines are ordered as: whisker_min, whisker_max, box_min, median, box_max
whisker_low = ax.lines[i * 5].get_ydata()[1]
whisker_high = ax.lines[i * 5 + 1].get_ydata()[1]
q1 = ax.lines[i * 5].get_ydata()[0]
median = ax.lines[i * 5 + 4].get_ydata()[0]
q3 = ax.lines[i * 5 + 1].get_ydata()[0]
# Get the center position for the label
x_value = ax.lines[i * 5 + 3].get_xdata().mean()
# Label each part with its value
ax.text(x_value, median, f'{median:.2f}', ha='center', va='center', fontsize=10, fontweight='bold')
ax.text(x_value, q1, f'{q1:.2f}', ha='center', va='center', fontsize=10, fontweight='bold')
ax.text(x_value, q3, f'{q3:.2f}', ha='center', va='center', fontsize=10, fontweight='bold')
ax.text(x_value, whisker_low, f'{whisker_low:.2f}', ha='center', va='center', fontsize=10, fontweight='bold')
ax.text(x_value, whisker_high, f'{whisker_high:.2f}', ha='center', va='center',fontsize=10, fontweight='bold')
# Show the plot
plt.title('Time until first ambulance arrival (minutes)')
plt.ylabel('Minutes')
plt.show()
tmp = df[df.unit_type == 'MEDIC']
tmp = tmp.loc[tmp.groupby('incident_number')['dispatch_dttm'].idxmin()]
medic = tmp[(tmp.received_to_onscene >= 10) & (tmp.received_to_onscene < 120)]
print('% data filtered: {0:.2f}%'.format(len(medic) *100/ len(df[df.unit_type == 'MEDIC'])))
print(medic.shape)
medic.unit_id.nunique()
% data filtered: 54.91%
(172156, 46)
104
medic.final_priority.value_counts(normalize=True)
final_priority
2 0.665193
3 0.334807
Name: proportion, dtype: float64
fig, axs = plt.subplots(1, 2)
sns.histplot(medic, x='received_to_onscene', ax=axs[0], hue='final_priority')
axs[0].set_title('Raw on scene duration')
sns.histplot(medic, x='received_to_onscene', log_scale=True, ax=axs[1], hue='final_priority')
axs[1].set_title('log(on scene duration)')
Text(0.5, 1.0, 'log(on scene duration)')
import numpy as np
medic['received_to_dispatch_perc'] = medic.received_to_dispatch / medic.received_to_onscene
medic['dispatch_to_response_perc'] = medic.dispatch_to_response / medic.received_to_onscene
medic['response_to_onscene_perc'] = medic.response_to_onscene / medic.received_to_onscene
medic['received_to_dispatch_log'] = np.log(medic.received_to_dispatch)
medic['dispatch_to_response_log'] = np.log(medic.dispatch_to_response)
medic['response_to_onscene_log'] = np.log(medic.response_to_onscene)
medic['received_to_onscene_log'] = np.log(medic.received_to_onscene)
/var/folders/h2/v02yh68x5vz3hbxw0mxbfpsw0000gn/T/ipykernel_86173/642784944.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/var/folders/h2/v02yh68x5vz3hbxw0mxbfpsw0000gn/T/ipykernel_86173/642784944.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/var/folders/h2/v02yh68x5vz3hbxw0mxbfpsw0000gn/T/ipykernel_86173/642784944.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/Users/domfj/Library/Caches/pypoetry/virtualenvs/sf-ff-service-KkBImDKy-py3.11/lib/python3.11/site-packages/pandas/core/arraylike.py:399: RuntimeWarning:
divide by zero encountered in log
/var/folders/h2/v02yh68x5vz3hbxw0mxbfpsw0000gn/T/ipykernel_86173/642784944.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/Users/domfj/Library/Caches/pypoetry/virtualenvs/sf-ff-service-KkBImDKy-py3.11/lib/python3.11/site-packages/pandas/core/arraylike.py:399: RuntimeWarning:
divide by zero encountered in log
/var/folders/h2/v02yh68x5vz3hbxw0mxbfpsw0000gn/T/ipykernel_86173/642784944.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/Users/domfj/Library/Caches/pypoetry/virtualenvs/sf-ff-service-KkBImDKy-py3.11/lib/python3.11/site-packages/pandas/core/arraylike.py:399: RuntimeWarning:
divide by zero encountered in log
/var/folders/h2/v02yh68x5vz3hbxw0mxbfpsw0000gn/T/ipykernel_86173/642784944.py:7: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/var/folders/h2/v02yh68x5vz3hbxw0mxbfpsw0000gn/T/ipykernel_86173/642784944.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df.loc[:,'received_hour'] = df.received_dttm.dt.hour
medic.loc[:,'received_hour'] = medic.received_dttm.dt.hour
df.loc[:,'received_day_of_week'] = df.received_dttm.dt.dayofweek
medic.loc[:,'received_day_of_week'] = medic.received_dttm.dt.dayofweek
/var/folders/h2/v02yh68x5vz3hbxw0mxbfpsw0000gn/T/ipykernel_86173/4013571862.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/var/folders/h2/v02yh68x5vz3hbxw0mxbfpsw0000gn/T/ipykernel_86173/4013571862.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
medic_onscene_hr = medic.groupby('received_hour').received_to_onscene.mean().to_frame().reset_index()
tmp = df.groupby('call_number').received_hour.first().to_frame().reset_index()
df_call_count = tmp.groupby('received_hour').size().to_frame(name='count').reset_index()
fig, ax1 = plt.subplots()
# Plot the first line plot
sns.lineplot(x='received_hour', y='received_to_onscene', data=medic_onscene_hr, ax=ax1, color='b',
label='Mean response duration')
# Create a second y-axis with twinx
ax2 = ax1.twinx()
# Plot the second line plot on the second y-axis
sns.lineplot(x='received_hour', y='count', data=df_call_count, ax=ax2, color='r', label='# calls received')
lines_1, labels_1 = ax1.get_legend_handles_labels()
lines_2, labels_2 = ax2.get_legend_handles_labels()
ax1.legend(lines_1 + lines_2, labels_1 + labels_2, loc='upper left')
ax2.get_legend().remove()
# Add labels for both y-axes
ax1.set_ylabel('Ambulance response duration (min)', color='b')
ax2.set_ylabel('# calls received', color='r')
ax1.set_xlabel('Hour that call was received')
plt.title('Calls received and response times by hour of day')
# Show the plot
plt.show()
medic_onscene_day = medic.groupby('received_day_of_week').received_to_onscene.mean().to_frame().reset_index()
tmp = df.groupby('call_number').received_day_of_week.first().to_frame().reset_index()
df_call_count = tmp.groupby('received_day_of_week').size().to_frame(name='count').reset_index()
fig, ax1 = plt.subplots()
# Plot the first line plot
sns.lineplot(x='received_day_of_week', y='received_to_onscene', data=medic_onscene_day, ax=ax1, color='b',
label='Mean response duration')
# Create a second y-axis with twinx
ax2 = ax1.twinx()
# Plot the second line plot on the second y-axis
sns.lineplot(x='received_day_of_week', y='count', data=df_call_count, ax=ax2, color='r', label='# calls received')
lines_1, labels_1 = ax1.get_legend_handles_labels()
lines_2, labels_2 = ax2.get_legend_handles_labels()
ax1.legend(lines_1 + lines_2, labels_1 + labels_2, loc='upper left')
ax2.get_legend().remove()
# Add labels for both y-axes
ax1.set_ylabel('Ambulance response duration (min)', color='b')
ax2.set_ylabel('# calls received', color='r')
ax1.set_xlabel('Day of week')
ax1.set_xticklabels(['unused','Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.title('Calls received and response times by day of week')
# Show the plot
plt.show()
/var/folders/h2/v02yh68x5vz3hbxw0mxbfpsw0000gn/T/ipykernel_86173/1186794475.py:26: UserWarning:
set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
sns.jointplot(medic.sample(frac=0.2, random_state=0),
x='received_to_dispatch_perc', y='received_to_onscene_log',
hue='final_priority', kind="kde", hue_order=['2', '3']);
plt.suptitle("% duration(received -> dispatch) vs log(response duration)");
plt.xlabel('% time spent to dispatch');
plt.ylabel('log (response duration)');
plt.tight_layout();
sns.jointplot(medic.sample(frac=0.2, random_state=0),
x='dispatch_to_response_perc', y='received_to_onscene_log',
kind='kde', hue='final_priority', hue_order=['2', '3']);
plt.suptitle("% duration(dispatch -> response) vs log(response duration)");
plt.xlabel('% duration(dispatch -> response)');
plt.ylabel('log (response duration)');
plt.tight_layout();
sns.jointplot(medic.sample(frac=0.2, random_state=0),
x='response_to_onscene_perc', y='received_to_onscene_log',
kind='kde', hue='final_priority', hue_order=['2', '3']);
plt.suptitle("% duration(response -> on scene) vs log(response duration)");
plt.xlabel('% duration(response -> onscene)');
plt.ylabel('log (response duration)');
plt.tight_layout();
medic_sorted = medic.sort_values('received_dttm')
medic_sorted['prev_avail'] = None
for unit_id in medic_sorted.unit_id.unique():
unit_filter = medic_sorted.unit_id == unit_id
index_to_set = medic_sorted[unit_filter].iloc[1:].index
prev_avail = medic_sorted[unit_filter].iloc[:-1].available_dttm
prev_avail.index = index_to_set
medic_sorted.loc[unit_filter, 'prev_avail'] = prev_avail
medic_sorted.prev_avail = pd.to_datetime(medic_sorted.prev_avail)
medic_sorted['prev_rest_min'] = (medic_sorted.received_dttm - medic_sorted.prev_avail).dt.total_seconds() / 60
tmp = medic_sorted.sample(frac=0.01, random_state=0)
sns.jointplot(tmp[(tmp.prev_rest_min < 2500) & (tmp.prev_rest_min > -50)],
x='prev_rest_min', y='received_to_dispatch_perc', hue='final_priority', hue_order=['2', '3'],
kind='kde')
<seaborn.axisgrid.JointGrid at 0x2ba3c2f10>
Compare prev avail, see if long on scene duration time is an issue. Also worth looking into received to dispatch time? Else, try to locate problematic geographical areas to see if resource allocation is an issue. Also see how much of an improvement you need to improve KPI by.
sum(medic_sorted.prev_rest_min < 0) / len(medic_sorted)
medic_sorted['avail_when_received'] = medic_sorted.prev_rest_min > 0
sns.boxplot(medic_sorted, y='received_to_dispatch_perc', hue='avail_when_received');
plt.title("% duration(received -> dispatch) for busy and available ambulances");
plt.ylabel('% duration(received -> dispatch)');
.01% of data is kinda messed up: prev available is greater than response dttm lmao. Anyhow, conclusion is that when you’re not available when received (8% of the time), time to dispatch perc mean greater by 50%, with 75th quartile twice as large.
Also, so much outlier data for dispatch to received. maybe thats due to data error
NOw its time to see if travel distance is worse for certain areas
def dedup_calls(data, condition=None):
if condition is not None:
filtered = data[condition]
else:
filtered = data
filtered.groupby(['slong', 'slat'])
df = filtered.groupby(['slong', 'slat']).agg({
'lat': 'mean',
'long': 'mean',
'norm_lat': ['mean','count'],
'norm_long': 'mean',
'received_to_onscene': 'mean'
})
df.columns = [' '.join(col).strip() for col in df.columns.values]
df = df.rename(columns={'norm_lat mean': 'norm_lat', 'norm_lat count': 'count', 'norm_long mean': 'norm_long',
'lat mean': 'lat', 'long mean': 'long', 'received_to_onscene mean': 'received_to_onscene'})
df['count'] *= df['received_to_onscene']
return df
from sklearn import cluster
x = dedup_calls(medic_sorted)
print(x.shape)
km = cluster.KMeans(n_clusters=50, random_state=0)
preds = pd.Series(km.fit_predict(x[['norm_lat', 'norm_long']], sample_weight=x['count'])).to_frame(name='cluster')
preds.index = x.index
ax = sns.scatterplot(pd.concat([x, preds], axis=1), x='norm_long', y='norm_lat', hue='cluster', palette = sns.color_palette("hls", 15));
ax.get_legend().remove()
(10237, 6)
/var/folders/h2/v02yh68x5vz3hbxw0mxbfpsw0000gn/T/ipykernel_86173/2909112982.py:7: UserWarning:
The palette list has fewer values (15) than needed (50) and will cycle, which may produce an uninterpretable plot.
medic_sorted_clustered = pd.merge(medic_sorted, preds, on=['slong', 'slat'])
assert len(medic_sorted) == len(medic_sorted_clustered)
assert medic_sorted.shape[1] + 1 == medic_sorted_clustered.shape[1]
cluster_metrics = medic_sorted_clustered.groupby('cluster').response_to_onscene.mean()
ordered_clusters = list((cluster_metrics).sort_values().index)
sns.catplot(medic_sorted_clustered, x='cluster', y='response_to_onscene', kind="box", log_scale=True,
height=5, aspect=2, order=ordered_clusters)
<seaborn.axisgrid.FacetGrid at 0x2bbb0f810>
sns.scatterplot(medic_sorted_clustered[medic_sorted_clustered.cluster.isin(ordered_clusters[-20:])],
x='norm_long', y='norm_lat', hue='cluster');
import geopandas as gpd
nhoods = gpd.read_file('raw_data_analysis/Analysis Neighborhoods.geojson')
stations = pd.read_csv('raw_data_analysis/stations.csv')
hospitals = pd.read_csv('raw_data_analysis/hospitals.csv')
import plotly.graph_objs as go
import numpy as np
def show_clusters_i(data, ordered_clusters):
sample = data.copy().sample(frac=0.2, random_state=0)
# Function to generate new data dynamically
def get_lats(cluster):
return sample[sample.cluster == cluster].lat
def get_longs(cluster):
return sample[sample.cluster == cluster].long
# Create a separate trace for each cluster
traces = []
max_clusters = 20
for cluster in list(reversed(ordered_clusters))[:max_clusters]:
traces.append(go.Scatter(
x=get_longs(cluster),
y=get_lats(cluster),
mode='markers',
name=f'Cluster {cluster}',
visible=False # Initially, all traces are hidden
))
for trace in traces[:10]:
trace.visible = True
def get_polygon_coords(geometry):
"""Extract x, y coordinates from Polygon or MultiPolygon."""
if geometry.geom_type == 'Polygon':
# For a single Polygon, get the exterior coordinates
x, y = geometry.exterior.xy
return [(list(x), list(y))] # Return a list with one tuple of (x, y) coordinates
elif geometry.geom_type == 'MultiPolygon':
# For a MultiPolygon, get coordinates for each Polygon's exterior
polygons_coords = []
for polygon in geometry.geoms: # Iterate over each Polygon in the MultiPolygon
x, y = polygon.exterior.xy
polygons_coords.append((list(x), list(y)))
return polygons_coords
return [] # In case there are no coordinates
# Create scatter traces for each polygon
polygon_traces = []
for i, row in nhoods.iterrows():
geometry = row['geometry']
nhood_name = row['nhood']
coords = get_polygon_coords(geometry)
if geometry.geom_type == 'Polygon':
# For simple polygons, add one trace
x, y = coords
polygon_traces.append(go.Scatter(
x=x,
y=y,
name=nhood_name,
line=dict(color='black'),
opacity=0.5
))
elif geometry.geom_type == 'MultiPolygon':
# For multipolygons, add multiple traces
for x, y in coords:
polygon_traces.append(go.Scatter(
x=x,
y=y,
name=nhood_name,
line=dict(color='black'),
opacity=0.5
))
# Create steps for the slider, which will control the visibility of the clusters
steps = []
for i in range(max_clusters): # Now the slider will go from 1 to 20 clusters
step = dict(
method='update',
args=[{'visible': [False] * len(traces + polygon_traces)}, # Hide all traces initially
{'title': f'Top {i+1} areas with largest mean response duration'}],
label=f'{i + 1}'
)
step['args'][0]['visible'][:i+1] = [True] * (i+1)
step['args'][0]['visible'][max_clusters:] = [True] * len(polygon_traces)
steps.append(step)
# Create the slider
sliders = [dict(
active=9,
currentvalue={"prefix": "Number of clusters: "},
pad={"t": 50},
steps=steps
)]
# Layout with a slider and buttons
layout = go.Layout(
height=800,
title="Top 10 areas with largest mean response duration",
sliders=sliders,
xaxis=dict(
title='longitude',
range=[-122.52, -122.32],
),
yaxis=dict(
title='latitude',
range=[37.7, 37.85],
),
showlegend=False
)
# # Create the figure with initial scatter and polygons
fig = go.Figure(layout=layout)
for trace in traces:
fig.add_trace(trace)
for trace in polygon_traces:
fig.add_trace(trace)
# # Export to HTML (for use in Jupyter Book or web embedding)
# fig.write_html("scatter_dynamic_polygons.html")
# Display the figure (in case you're running this interactively)
fig.show()
show_clusters_i(medic_sorted_clustered, ordered_clusters)
conclusion: inner clusters have worse mean time from response to on scene. could be data quality, but anyhow the data points to driving time being the biggest factor contributing to longer response times. its not units being so preoccupied. which means perhaps better resource allocation, or procedures for getting there, or just cleaner roads so that units can drive faster, would help, and especially in the inner clusters.
When not weighting for frequency of calls, its the outer clusters.
from sklearn import cluster
def display_clusters_by_onscene_time(data, condition):
data = data.copy()
data = data[condition]
x = dedup_calls(data)
km = cluster.KMeans(n_clusters=50, random_state=0)
preds = pd.Series(km.fit_predict(x[['norm_lat', 'norm_long']], sample_weight=x['count'])).to_frame(name='cluster')
preds.index = x.index
data_clustered = pd.merge(data, preds, on=['slong', 'slat'])
cluster_metrics = data_clustered.groupby('cluster').response_to_onscene.mean()
ordered_clusters = list(cluster_metrics.sort_values().index)
# sns.catplot(data_clustered, x='cluster', y='response_to_onscene', kind="box", log_scale=True,
# height=5, aspect=2, order=ordered_clusters)
show_clusters_i(data_clustered, ordered_clusters)
display_clusters_by_onscene_time(medic_sorted, medic_sorted.final_priority == '3')
display_clusters_by_onscene_time(medic_sorted, medic_sorted.final_priority == '2')
from scipy.stats import permutation_test
def run_permutation_test(data, condition):
data = data.copy()
data = data[condition]
x = dedup_calls(data)
km = cluster.KMeans(n_clusters=50, random_state=0)
preds = pd.Series(km.fit_predict(x[['norm_lat', 'norm_long']], sample_weight=x['count'])).to_frame(name='cluster')
preds.index = x.index
data_clustered = pd.merge(data, preds, on=['slong', 'slat'])
cluster_metrics = data_clustered.groupby('cluster').response_to_onscene.mean()
ordered_clusters = list(cluster_metrics.sort_values().index)
data_clustered['worst_clusters'] = data_clustered.cluster.isin(ordered_clusters[-20:])
sampled = data_clustered.sample(frac=0.1, random_state=0)
group1 = sampled[sampled.worst_clusters].response_to_onscene
group2 = sampled[~sampled.worst_clusters].response_to_onscene
# Run the permutation test
result = permutation_test((group1, group2),
statistic=lambda x, y: np.mean(x) - np.mean(y),
n_resamples=10000, # Monte Carlo approximation with 10,000 resamples
alternative='greater')
print(result)
# run_permutation_test(medic_sorted, medic_sorted.final_priority == '2')
# run_permutation_test(medic_sorted, medic_sorted.final_priority == '3')
Back to investigating what’s correlated with prev_rest_min < 0
medic_sorted.loc[:, 'has_transport'] = medic_sorted.transport_dttm.notnull()
medic_sorted['prev_has_transport'] = None
for unit_id in medic_sorted.unit_id.unique():
unit_filter = medic_sorted.unit_id == unit_id
index_to_set = medic_sorted[unit_filter].iloc[1:].index
prev_has_transport = medic_sorted[unit_filter].iloc[:-1].has_transport
prev_has_transport.index = index_to_set
medic_sorted.loc[unit_filter, 'prev_has_transport'] = prev_has_transport
tmp = medic_sorted.sample(frac=0.2, random_state=0)
sns.jointplot(tmp[(tmp.prev_rest_min < 2500) & (tmp.prev_rest_min > -50)],
x='prev_rest_min', y='received_to_dispatch_perc', hue='prev_has_transport',
kind='kde')
<seaborn.axisgrid.JointGrid at 0x12e7f7bd0>
investgating diversions#
# %load_ext autoreload
# %autoreload 2
# from utils import get_hospital_diversions
diversions = pd.read_parquet('raw_data_analysis/hospital_diversions.parquet')
diversions.head()
| hospital_full | hospital_abbrev | diversion_category | started | ended | duration_in_minutes | |
|---|---|---|---|---|---|---|
| 0 | California Pacific Medical Center - Van Ness | CPMC - Van Ness | ED | 2024-04-30 14:35:00 | 2024-04-30 14:44:00 | 9.0 |
| 1 | California Pacific Medical Center - Van Ness | CPMC - Van Ness | ED | 2024-04-30 20:31:00 | 2024-04-30 22:27:00 | 116.0 |
| 2 | California Pacific Medical Center - Van Ness | CPMC - Van Ness | ED | 2024-05-01 09:18:00 | 2024-05-01 11:18:00 | 120.0 |
| 3 | California Pacific Medical Center - Van Ness | CPMC - Van Ness | ED | 2024-05-01 11:21:00 | 2024-05-01 11:28:00 | 7.0 |
| 4 | California Pacific Medical Center - Van Ness | CPMC - Van Ness | ED | 2024-05-01 15:27:00 | 2024-05-01 17:27:00 | 120.0 |
diversions.shape
(12181, 6)
diversions.duration_in_minutes = diversions.duration_in_minutes.astype(float)
diversions.duration_in_minutes.plot.hist(bins=100)
<Axes: ylabel='Frequency'>
diversions[diversions.duration_in_minutes < 60]
| hospital_full | hospital_abbrev | diversion_category | started | ended | duration_in_minutes | |
|---|---|---|---|---|---|---|
| 0 | California Pacific Medical Center - Van Ness | CPMC - Van Ness | ED | 2024-04-30 14:35:00 | 2024-04-30 14:44:00 | 9.0 |
| 3 | California Pacific Medical Center - Van Ness | CPMC - Van Ness | ED | 2024-05-01 11:21:00 | 2024-05-01 11:28:00 | 7.0 |
| 5 | California Pacific Medical Center - Van Ness | CPMC - Van Ness | ED | 2024-05-01 17:29:00 | 2024-05-01 18:22:00 | 53.0 |
| 6 | California Pacific Medical Center - Van Ness | CPMC - Van Ness | ED | 2024-05-01 22:21:00 | 2024-05-01 22:32:00 | 11.0 |
| 7 | California Pacific Medical Center - Van Ness | CPMC - Van Ness | ED | 2024-05-02 02:32:00 | 2024-05-02 02:44:00 | 12.0 |
| ... | ... | ... | ... | ... | ... | ... |
| 12174 | Chinese Hospital | Chinese | ED | 2024-03-16 12:25:00 | 2024-03-16 12:27:00 | 2.0 |
| 12175 | Chinese Hospital | Chinese | ED | 2024-03-18 10:15:00 | 2024-03-18 10:17:00 | 2.0 |
| 12177 | Chinese Hospital | Chinese | ED | 2024-03-19 16:01:00 | 2024-03-19 16:03:00 | 2.0 |
| 12178 | Chinese Hospital | Chinese | ED | 2024-03-29 12:48:00 | 2024-03-29 13:43:00 | 55.0 |
| 12179 | Chinese Hospital | Chinese | ED | 2024-04-02 11:40:00 | 2024-04-02 11:42:00 | 2.0 |
5000 rows × 6 columns
# Define the range of hours you want to analyze
all_periods = pd.date_range(start=diversions['started'].min().floor('30T'),
end=diversions['ended'].max().ceil('30T'), freq='30T')
# Create a DataFrame to store the number of hospitals in diversion each hour
diversion_counts = pd.DataFrame({'period': all_periods})
# For each hour, count the number of hospitals that are in diversion
diversion_counts['diversion_count'] = diversion_counts['period'].apply(
lambda h: np.sum((diversions['started'] <= h) & (diversions['ended'] > h))
)
/var/folders/h2/v02yh68x5vz3hbxw0mxbfpsw0000gn/T/ipykernel_86173/925204955.py:2: FutureWarning:
'T' is deprecated and will be removed in a future version, please use 'min' instead.
/var/folders/h2/v02yh68x5vz3hbxw0mxbfpsw0000gn/T/ipykernel_86173/925204955.py:3: FutureWarning:
'T' is deprecated and will be removed in a future version, please use 'min' instead.
diversion_counts.head()
| period | diversion_count | |
|---|---|---|
| 0 | 2023-01-01 02:00:00 | 0 |
| 1 | 2023-01-01 02:30:00 | 1 |
| 2 | 2023-01-01 03:00:00 | 1 |
| 3 | 2023-01-01 03:30:00 | 1 |
| 4 | 2023-01-01 04:00:00 | 1 |
medic_sorted['received_30T'] = medic_sorted.received_dttm.dt.floor('30min')
medic_sorted['received_30T_prev'] = medic_sorted.received_30T - pd.Timedelta(minutes=30)
medic_sorted_with_diversions = medic_sorted.merge(diversion_counts, how='inner',
left_on='received_30T', right_on='period')
medic_sorted_with_diversions = medic_sorted_with_diversions.merge(diversion_counts, how='inner',
left_on='received_30T_prev', right_on='period',
suffixes=('_curr', '_prev'))
medic_sorted_with_diversions.head()
| id | call_number | unit_id | incident_number | call_type | call_date | watch_date | received_dttm | entry_dttm | dispatch_dttm | ... | prev_rest_min | avail_when_received | has_transport | prev_has_transport | received_30T | received_30T_prev | period_curr | diversion_count_curr | period_prev | diversion_count_prev | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 230010486-66 | 230010486 | 66 | 23000085 | Medical Incident | 2023-01-01 | 2022-12-31 | 2023-01-01 02:35:05 | 2023-01-01 02:35:05 | 2023-01-01 03:04:46 | ... | -15.400000 | False | False | True | 2023-01-01 02:30:00 | 2023-01-01 02:00:00 | 2023-01-01 02:30:00 | 1 | 2023-01-01 02:00:00 | 0 |
| 1 | 230010507-60 | 230010507 | 60 | 23000091 | Medical Incident | 2023-01-01 | 2022-12-31 | 2023-01-01 02:36:52 | 2023-01-01 02:40:11 | 2023-01-01 02:40:33 | ... | 147.883333 | True | True | True | 2023-01-01 02:30:00 | 2023-01-01 02:00:00 | 2023-01-01 02:30:00 | 1 | 2023-01-01 02:00:00 | 0 |
| 2 | 230010499-68 | 230010499 | 68 | 23000088 | Medical Incident | 2023-01-01 | 2022-12-31 | 2023-01-01 02:38:43 | 2023-01-01 02:38:43 | 2023-01-01 02:55:25 | ... | -16.633333 | False | False | True | 2023-01-01 02:30:00 | 2023-01-01 02:00:00 | 2023-01-01 02:30:00 | 1 | 2023-01-01 02:00:00 | 0 |
| 3 | 230010515-77 | 230010515 | 77 | 23000093 | Medical Incident | 2023-01-01 | 2022-12-31 | 2023-01-01 02:39:25 | 2023-01-01 02:42:10 | 2023-01-01 02:43:38 | ... | 6.616667 | True | False | True | 2023-01-01 02:30:00 | 2023-01-01 02:00:00 | 2023-01-01 02:30:00 | 1 | 2023-01-01 02:00:00 | 0 |
| 4 | 230010525-64 | 230010525 | 64 | 23000094 | Medical Incident | 2023-01-01 | 2022-12-31 | 2023-01-01 02:43:16 | 2023-01-01 02:45:28 | 2023-01-01 02:57:10 | ... | 13.150000 | True | True | True | 2023-01-01 02:30:00 | 2023-01-01 02:00:00 | 2023-01-01 02:30:00 | 1 | 2023-01-01 02:00:00 | 0 |
5 rows × 66 columns
sns.jointplot(medic_sorted_with_diversions.sample(frac=0.2, random_state=0), x='diversion_count_curr',
y='received_to_dispatch_perc', kind='kde')
<seaborn.axisgrid.JointGrid at 0x2bb922d50>
sns.jointplot(medic_sorted_with_diversions.sample(frac=0.2, random_state=0), x='diversion_count_prev',
y='received_to_dispatch_perc', kind='kde')
<seaborn.axisgrid.JointGrid at 0x2bba2dd50>
sns.jointplot(medic_sorted_with_diversions.sample(frac=0.2, random_state=0), x='diversion_count_prev',
y='received_to_onscene_log', kind='kde')
<seaborn.axisgrid.JointGrid at 0x2c6e7fcd0>